Minimal representations of possibility at age 3

Young children do not always consider alternative possibilities when planning. Suppose a prize is hidden in a single occluded container and another prize is hidden in an occluded pair. If given a chance to choose one container and receive its contents, choosing the singleton maximizes expected reward because each member of the pair might be empty. Yet, 3-y-olds choose a member of the pair almost half the time. Why don’t they maximize expected reward? Three studies provide evidence that 3-y-olds do not deploy possibility concepts like MIGHT, which would let them represent that each container in the pair might and might not contain a prize. Rather, they build an overly specific model of the situation that correctly specifies that the singleton holds a prize while inappropriately specifying which member of the pair holds a prize and which is empty. So, when asked to choose a container, they see two equally good options. This predicts approximately 50% choice of the singleton, observed in studies 1 and 3. But when asked to throw away a container so that they can receive the remaining contents (study 2), they mostly throw away a member of the pair. The full pattern of data is expected if children construct overly specific models. We discuss whether 3-year-olds lack possibility concepts or whether performance demands prevent deployment of them in our tasks.

We sought to replicate the in-lab finding that 3-year-olds pick wisely (i.e., pick the singleton chest) on 60% of Pick 1 of 3 trials in the 3-container task (1,2).
Warmup. In the warmup, six differently-colored treasure chests appeared on-screen. The child was asked to name the color of each chest. The goal of the task was interactivity, not accuracy, so any answer was accepted.
Training phase In the training phase children learned to open a chest and find a coin, under conditions of perfect information: Every time they chose a chest, they knew the location of every coin.
Training phase 1: Single chest. The first training trial familiarized children with the goal of the game: to open a chest that held a coin. This trial started with one chest on the screen. A coin appeared above it. The chest opened and the coin moved into the chest, which then closed.
The child was asked where the coin had been hidden. Once the child indicated the chest, it opened and revealed the coin.
Training phase 2: Two chests. In the next training trial two chests appeared on screen.
Then a coin appeared above the chests. The chests opened, the coin moved into one of the chests, and the chests closed. Then the child was asked which chest she or he wanted to open.
The contents of the selected chest were then revealed. All children saw at least two trials.

Training phase 3: Demonstration of appropriate reasoning.
In these trials we told children how to solve the problem-to pick a chest that was sure to hold a coin-without teaching them to pick a singleton chest. Two pairs of chests appeared. One pair was occluded by a pirate flag. Two coins appeared, one above each occluded chest. These coins simultaneously descended directly downward so that children could tell that one coin went into each occluded chest. The occluder was removed, and the other pair was occluded. One coin appeared above the space between the two occluded chests; it descended and was hidden behind the flag in one of the two chests. The occluder was removed. It was not clear which of those two chests held the coin. After all coins were hidden, the experimenter indicated the side where two coins were hidden and told the child that if he picked a chest on that side he could be certain that he would win a coin. But if he picked a chest on the other side, he could not be certain that he would win a coin. The experimenter then opened a chest on the side with two coins. Thus he told children how to solve the problem and illustrated the solution, without modeling a behavior that could be copied in the 3-chest test trials. The demonstration was repeated, changing only which pair of chests held two coins. Three variables were counterbalanced between participants: the side where coins were hidden first (L/R), the side first indicated and discussed by the experimenter (L/R), and the location of the two coin side in the first demonstration (L/R).
Test phase. Two sets of chests appeared: a singleton and a pair (main text, Fig. 1). One set was occluded and a coin was hidden, and the occluder disappeared. Then the other set was occluded and a second coin was hidden there, and the occluder disappeared. The child was asked which chest they were "really sure" held a coin. The chosen chest opened and revealed its contents. Children saw four trials. Three variables were counterbalanced between participants: the location of the singleton chest in the first trial (L/R), the positioning of the coin in the pair 2 (LRRL / RLLR), and the side where coins were hidden first (L/R). The side (L/R) where coins were hidden first was constant within participants across trials. The side of the singleton and pair swapped sides each trial.

Participants
Participants were 24 3-year-olds (mean = 3.63, range = 3.14-3.94, 10 female). Data from eight additional children were excluded: three failed to complete the study, four made consecutive errors in the three-chest training, and one made consecutive errors in the flag training (see below).

Procedure
Warmup. The warmup was the same as in Study 1.
Training phase. In the training phase children learned to throw away a chest, and that after doing so they received the contents of all remaining chests.
Training phase 1: Single chest demonstration. One single-chest demonstration introduced the child to the action of throwing away a chest. A single chest appeared on the screen. The experimenter explained that in this game, the child would pick one chest to throw away, and then win any coins left behind. The experimenter demonstrated throwing a chest away by clicking on the chest, causing it to fade out.
Training phase 2: Two chests. Next, the child practiced throwing away an empty chest in order to win all coins that remained. Two chests appeared on the screen. The chests opened, a coin entered one of the chests, and the chests closed. The child was asked which chest they wanted to throw away. The chosen chest faded out; the other chest opened to reveal its contents.
All children saw at least two trials.
Training phase 3: Three chests. The Three-chest training trial encouraged the child to maximize the number of coins they received. Three chests appeared on screen, equally-spaced.
The chests opened, a coin went into one of the chests, and the chests closed. Then the two empty chests opened, a coin went into one of them, and the chests closed. The child was asked which chest they wanted to throw away. The chosen chest faded out, and the remaining two chests revealed their contents in turn.

Training phase 4: Flag training.
The flag training trial familiarized children with flag occluders. Two chests appeared on the screen. Two flags appeared, one occluding each chest, with a gap between the flags. A coin descended behind one of the occluders, and the occluders disappeared, revealing only the two chests. The coin had to be inside one of the chests, and since there was a gap between the flags, children could infer which chest it was in. The child was asked which chest they wanted to throw away. The chosen chest faded out, and the remaining chest opened to reveal its contents.
Training phase 5: Certainty-scaffolding demonstration. Penultimately, one demonstration trial encouraged children to attend to their own certainty. Three equally-spaced chests were occluded simultaneously with a single large pirate flag, and a coin went behind the flag. The experimenter told the child that the computer is going to hide a coin in one of the chests.

3
The flag was removed to show the three chests, each closed and hiding its contents, and the experimenter explained that he could not be sure where the coin was. This procedure was repeated two more times. After the third coin, the experimenter explained that he now could be sure where the coins were, as there had to be one coin in each chest.
Test phase. Two sets of chests appeared: a singleton and a pair (main text, Fig. 1). One set was occluded and a coin was hidden there, and the occluder disappeared. Then the other set was occluded and a second coin was hidden there, and the occluder disappeared. The experimenter reminded the child to try to win two coins. The child was then asked to pick a chest to throw away. The chosen chest faded out, and the remaining two chests opened to reveal their contents. Children saw eight trials. Three variables were counterbalanced between participants: Training phase. In the training phase children learned to throw away a chest, and then to pick a chest to open.

Training phases 1-3: Single chest demonstration, Two chests, and Flag training.
All of these training trials were the same as in Study 2, except that when the chest was thrown away, a red X appeared to mark its former location. Test phase. Two sets of chests appeared: a singleton and a pair (main text, Fig. 1). One set was occluded and a coin was hidden there, and the occluder disappeared. Then the other set was occluded and a second coin was hidden there, and the occluder disappeared. The experimenter asked the child which treasure chest they wanted to throw away. The selected 4 chest faded out. A red X appeared in its place to remind the child of the structure of the hiding phase: two chests in one set and only one chest on the other. Then the experimenter asked the child which chest they wanted to open. The selected chest opened and revealed its contents.
Children saw eight trials. Three variables were counterbalanced between participants: the location of the singleton chest in the first trial (L/R), the side where coins were hidden tbdfirst in the first trial (L/R) and the positioning of the coin in the pair (LRRLRLLR / RLLRLRRL). The side of the singleton chest (L/R) and the side occluded first (L/R) swapped every trial.

Statistical methods
Data were analyzed with Bayesian random intercept GLMMs using the default weakly informative priors.The grouping variable in every model is participant id. All models were dummy coded. Below, response and predictor variables are specified when each model is introduced.

Statistical Background
Binary logistic models estimate the population mean log odds of a 1 response for each combination of predictors. In a Bayesian binary logistic model, the model estimate is not a point, but rather a distribution that captures the appropriate uncertainty about what the population mean is, given the data and the prior. Point estimates are typically provided by a measure of central tendency for that distribution. We have chosen to report the median as a measure of central tendency for all distributions, as the median tends to be more robust than the mean and maximum a posteriori (5). The maximum a posteriori is the continuous analog of the mode.
In Figure 2B (main text) posteriors are visualized as densities, the continuous analog of a histogram. The axes have been flipped so that the x-axis displays density (the continuous analog of count). The y-axis is a range of hypotheses regarding the probability of a wise decision. The area enclosed by the curve is 1. To evaluate a set of hypotheses about the probability of some event, e.g. the probability that children pick wisely more than 67% of the time on the throw away tasks, one calculates the proportion of the area enclosed by the curve associated with that set of hypotheses. In the example at hand, we would calculate the proportion of the area enclosed by the curve that is above .67 on the y-axis. To evaluate the probability of a point hypothesis, a Region of Practical Equivalence (ROPE) is defined around that hypothesis. This interval should be small enough that all points within that range are, for any practical purposes, equivalent to the hypothesis of interest. For example, to evaluate the probability that children pick the target cup 50% of the time on Pick 1 of 2, we will calculate the proportion of the area enclosed by the curve associated with the interval [.49, .51]. To evaluate the relative probability of this hypothesis, we divided the entire hypothesis space into 1000 discrete hypotheses, and calculated the probability of a ROPE of the same size around each of those 1000 hypotheses. We can then rank these ROPEs by their probability, and measure the relative probability of a given hypothesis by its rank: high ranking ROPEs have high relative probability, and low ranking ROPEs have low relative probability.
When the probability of an outcome is the same in two trial types (for example, if the probability of picking the singleton chest is .60 in both Pick 1 of 3 and Pick 1 of 2), then the odds of that outcome is also the same in both trial types (in this case, the odds of picking the singleton chest is 1.5 in both trial types). The ratio of those two odds is 1, and the log of that odds ratio is 0: no difference. A positive log odds ratio indicates greater odds in the numerator of the ratio than in the denominator. For example, if the probability of throwing away from the pair in Throw Away (Study 2) is .9 (odds = 9), and the probability of picking the singleton in Pick 1 of 3 is .6 (odds = 1.5), then the log of the odds ratio 9 / 1.5 is positive (in this example, it is about 1.79). Similarly, a negative log odds ratio indicates greater odds in the denominator. We will report contrasts as log odds ratios. We will always specify the numerator and denominator, e.g., "median log OR, Throw Away (Study 2) / Pick 1 of 3: 1.79". For all contrasts reported in this paper, a log odds ratio of ±0.52 can be considered a small effect, ±1.24 is medium, and ±1.90 is large (Chen et al. 2010).

Comparisons to chance
The main model predicted the probability of a wise decision from trial type, a factor with four levels: Pick 1 of 3 (Study 1), Throw Away (Study 2), Throw Away (Study 3), and Pick 1 of 2 (Study 3). For our first analysis we compared the estimated probability of making a wise decision in each trial type to chance. Chance was established by dividing the number of target cups by the total number of cups. We calculated the proportion of the posterior that was greater than these chance values in each trial type. In Pick 1 of 3 (Study 1), the entire posterior was greater than .33.
In Throw Away (Study 2), the entire posterior was greater than .67. In Throw Away (Study 3), 99.98% of the posterior was above .67. There are two important conclusions from these results.
First, children are not merely picking chests at random in any of these three trial types. Second, these results speak against the hypothesis that children deploy the low level strategies under discussion, as this hypothesis predicts that performance will be worse than chance on the Throw Away trial types.
In contrast with the first three trial types, in Pick 1 of 2 (Study 3), only 57.49% of the posterior was above .50, which is chance on this trial type (since there are only two cups to choose between). The median of this distribution was .51, 95% CI [.41, .62]. The probability that the population mean is .50 was evaluated by taking the interval [.49, .51] as a Region of Practical Equivalence (ROPE). The probability of this region was .148. By comparison, the probability of the ROPE [.50, .52] around the median, i.e., one of the most likely intervals of that width, is also .148. We divided the entire hypothesis space into 1000 point hypotheses, and calculated the probability of a ROPE of the same size around each of those 1000 hypotheses. We found that the interval [.49, .51] had higher probability than 99.5% of these intervals. Indeed, of the hypotheses within the 95% CI-that is, among the hypotheses that we would not reject on a frequentist analysis-the hypothesis that the population mean was .50 had higher probability than 99% of hypotheses.
In the first 3 trial types, we can have high confidence that the population mean is greater than chance. By contrast, in the last trial type, the hypothesis that the population mean is .50 (chance) is one of the highest probability hypotheses. Of course, a population mean of .50 could 6 come about in many different ways; for example, children could be picking chests at random (i.e., chance behavior), deploying minimal representations of possibility, or deploying the low level strategies under discussion. But children are not picking chests at random in any of the other trial types, even the same children in Study 3. Also, they are not deploying low level strategies in either of the Throw Away trial types. The full pattern of data prefers the hypothesis that children deploy minimal representations of possibility.
We now present three additional analyses. The first two differentiate among the three hypotheses depicted in Figure 2A, hypotheses that might explain the non-random behavior (60% wise choices) on Pick 1 of 3 tasks, including Study 1. First, we evaluate the differential predictions the three hypotheses make about the relative probabilities of wise decisions across trial types.
Second, we turn to the quantitative predictions concerning probability of wise decisions. Third, we analyze the distribution of individual participants' proportion wise decisions in Study 1. The observed mean performance rate among 3-year-olds over many Pick 1 of 3 studies is around 60%, and there are many ways to arrive at this mean. We analyze the proportion of correct responses across participants. We test whether the observed distribution is much more likely generated from a 80%/20% mixture of children who deploy minimal representations of possibility/ children who deploy possibility concepts, respectively, or from a 60%/40% mixture of children who guess at random/children who deploy possibility concepts, respectively.

I. Different Hypotheses Make Different Predictions about Relative Probabilities
With respect to the relative probabilities of wise decisions, the hypothesis that most 3- year-olds deploy minimal representations of possibility predicts that performance on the Pick 1 trial types will be worse than performance on the Throw Away trial types (Figure 2A). The other two hypotheses predict no differences across the four levels of the predictor (Figure 2A This noise would yield worse performance on Pick 1 of 3 than on Throw Away (because there are two distractors vs one, respectively). Another auxiliary assumption that might generate the observed data is that despite having possibility concepts and appreciating all the possibilities, many 3-year-olds guess entirely at random (see also section III below). Both of these hypotheses 7 predict worse performance on Pick 1 of 3, where chance is 33%, than on the Throw Away trial types, where chance is 67%.
The results from Pick 1 of 2 speak against both of these auxiliary assumptions. Each predict that performance should be better on Pick 1 of 2 than from Pick 1 of 3, as there are two distractors in Pick 1 of 3 and only one distractor in Pick 1 of 2, and chance is lower in Pick 1 of 3 (33%) than in Pick 1 of 2 (50%). Contrary to the prediction that performance on Pick 1 of 2 would be better than that on Pick 1 of 3, it is most likely that performance is slightly worse (median log OR, Pick 1 of 2 / Pick 1 of 3: -0.43, 95% CI [-1.16, 0.20]). This log odds ratio does not reach the rule of thumb cut-off for a small effect, but more importantly, is in the wrong direction. The probability that children are even slightly better on Pick 1 of 2 than Pick 1 of 3 (i.e., a log odds ratio of 0.52 or greater) is only .003.

II. Quantitative Predictions of the Minimal Representation of Possibility Hypothesis
The above analyses provide strong warrant to rule out the hypotheses that children deploy the low-level strategies under discussion and that they primarily deploy possibility concepts in this task. We turn now to the quantitative predictions of the hypothesis that children deploy minimal representations of possibility ( Figure 2A). Children who deploy minimal representations of possibility should always pick wisely in the Throw Away trial types, but should only pick wisely half of the time in the Pick 1 of 3 and Pick 1 of 2 trial types. Three of these predictions are not precisely born out. In Pick 1 of 3 the modeled probability of choosing wisely was 61%, not 50%. In Throw Away (Study 2) the modeled probability was 89%, not 100%. In Throw Away (Study 3) the modeled probability was 81%, not 100%. We first discuss these three departures, and then discuss why no such departure is observed in Pick 1 of 2.

Performance on Throw Away tasks is not 100%
Performance on the Throw Away trials was not 100%. In Study 2 it was 89%; in Study 3 it was 81%. Of course, some noise is inevitable in studies with 3-year-olds. Sources of noise include momentary inattention, refusal to play the game the experimenter has established, and many others. Moreover, there is some pragmatic oddness to the Throw Away task in Study 3. In contrast to Study 2, where one gets the contents of all of all the chests that remain after one is thrown away, throwing a chest away has no purpose in Study 3. The participant would get the same result if they simply opened the desired chest. Though the difference between Throw Away (Study 2) and Throw Away (Study 3) is small, and the 95% CI includes 0 (median log OR, Throw Away (Study 3) / Throw Away (Study 2): -0.62, 95% CI [-1.36, 0.07]), there may be some small decrease in performance that is not due entirely to noise.

Performance on Pick 1 of 3 is not 50%
Three-year-olds pick wisely 61% of the time on Pick 1 of 3 (Study 1), not 50% as depicted in Figure 2A. This departure was expected, given existing data. One explanation for this finding is that minimal representations of possibility underlie the performance of most 3-year-olds, and that the construction of possibility concepts begins between ages 3 and 4. If this is correct, then we might expect older 2-year-olds to exhibit the predicted 50% level of responding on Pick from 3. We might also expect the difference between older 2-year-olds and 3-year-olds to be small. To assess this, we assembled data from existing Pick 1 of 3 studies with older 2-year-olds and 3-year-olds. To prefigure the results: We found that the data are highly replicable within both age groups across all existing studies, that older 2-year-olds pick wisely about half of the time, To test this, we fit a model predicting the probability of a wise decision from age group alone. For Performance was better than chance (probability > .33 = .998). Since the 95% CI includes .5, the hypothesis that the population mean for older 2-year-olds is .5 cannot be ruled out. For a more nuanced analysis of the probability that older 2-year-olds pick the target 50% of the time, we defined a ROPE around .5 as the interval [.49, .51]. The probability of this interval is .14. For comparison, the probability of the ROPE [.46, .48] around the median-one of the highest probability hypotheses-is .16. We calculated similar ROPEs for 1000 hypotheses divided uniformly over the entire hypothesis space. The probability of the interval [.49, .51] was higher than 94% of these intervals. More conservatively, we compared the hypotheses that were inside the 95% CI (i.e., the hypotheses that a frequentist analysis cannot distinguish between). We found that the probability of the interval [.49, .51] was higher than the probability of 69% of these intervals. Not only can the hypothesis that older 2-year-olds pick the target 50% of the time not be ruled out; given the current data and our priors, this hypothesis is one of the highest probability hypotheses, as predicted if almost all older 2-year-olds deploy minimal representations of possibility.
Moreover, the difference between older 2-year-olds and 3-year-olds was not large. The most likely effect of age group was small (median log OR, 3-year-olds / older 2-year-olds: 0.56, 9 95% CI [0.10, 1.03]. It is likely that there is some improvement with age, as the 95% CI does not include 0. It is unlikely that the effect is even medium sized, as the 95% CI does not include 1.24. Thus, as predicted by the hypothesis that all older 2-year-olds and most 3-year-olds deploy minimal representations of possibility, and that a small handful of 3-year-olds deploy possibility concepts, the estimated population mean for older 2-year-olds is about .5, and 3-yearolds are only slightly better. Both of these predictions are supported by the data.

Pick 1 of 2 is almost exactly 50%
Finally, we discuss why 3-year-olds' performance in Pick 1 of 2 is almost exactly 50%, especially if a small proportion of 3-year-olds deploy possibility concepts, as demonstrated by the highly systematic 60% wise decisions in this age group on Pick 1 of 3. In fact, we preregistered the prediction that performance on Pick 1 of 2 would be about 60%, and the same as

III. Distribution of individual participants' proportion wise decisions on Study 1
In our explanation for why 3-year-olds' performance on Pick 1 of 3 is not 50%, we suggested that the population we sampled from is composed of two groups of children: one group who deploy minimal representations of possibility (thereby choosing wisely half of the time), and another group who deploy possibility concepts (thereby choosing wisely on every trial). An alternative hypothesis is that observed performance arises from a mixture of children who guess randomly (thereby choosing wisely a third of the time) and children who deploy possibility concepts. Notice that this alternative hypothesis is not one we have considered as of yet in this manuscript. On this hypothesis, children increasingly deploy modal concepts over the ages of 2 ½ to 4 or 5. Children who do not deploy modal concepts deploy neither minimal representations nor low level strategies, but rather choose randomly among the three containers. Throughout the paper we emphasized that chance-level performance is not observed on Pick 1 of 3. But we should also test whether the observed 60% performance rate on the Pick 1 of 3 measure is due to a mixture of children deploying modal concepts and children merely guessing among the visible cups.
We repeated these analyses on all existing data sets of 3-year-olds in the 3-containers task to evaluate the most likely mixtures. We combined the existing datasets (1, 2), eliminating participants who did not see a full complement of 3 trials, yielding a sample of 46 3-year-olds. In the observed data, 9% chose the singleton 0 of 3 times, 30% 1 of 3 times, 35% 2 of 3 times, and Next, we present the analysis of existing data from older 2-year-olds (1, 2). After In every group, the data are unlikely under the hypothesis that the population is a mixture of children who guess randomly and others who deploy possibility concepts. The data are not unlikely under the hypothesis that the population is a mixture of children who deploy minimal representations of possibility and children who deploy possibility concepts.

Excluded Data
In the analysis of the Pick 1 of 2 data, we excluded trials where children had thrown away the singleton chest. This is because the question, "Did they choose the singleton?" is not defined when the singleton was thrown away. This was 21% of trials in Study 3, which raises an important concern. Throwing away the singleton might indicate a failure to understand the task, and perhaps a large part of the data that was actually analyzed is also contributed by children who did not understand the task. There are two reasons to doubt that this is so. First, performance is well above chance on both Throw Away measures. Second, children deploying minimal representations of possibility in Study 3 represent that the singleton contains a coin and which chest from the pair contains a coin. If these beliefs were true, there would be nothing irrational about throwing away the singleton, as one could simply pick the remaining chest that holds a coin in the second phase. In Study 2, in contrast, after they throw away one of the chests, they get all the remaining chests. This provides motivation to throw away an empty chest, and they throw away the singleton chest less than in Study 3 (median log OR, Throw Away (Study 3) / Throw Away (Study 2): -0.62, 95% CI [-1.36, 0.07]; see Section II above).